Language Identification of Kannada Language using N-Gram
نویسنده
چکیده
Language identification is an important pre-processing step for any Natural Language Processing task. Kannada Language is an Indian Language and lot of research is being carried out on Kannada Language Processing. Major parts of online documents like websites are combination of Kannada and English Sentences. Language Identification is a preprocessing step for NLP tasks like POS tagging, Sentence Boundary Detection or Data mining technique. In this paper, we present an n-gram method of language identification for documents with Kannada, Telugu and English sentences. It has been shown how performance can be improved by n-gram processing only last word of the sentence instead of complete sentence. This method could also be preprocessing step for Sentence Boundary Detection discussed in [1]. General Terms Language Identification, Kannada Language.
منابع مشابه
Language identification for transliterated forms of Indian language queries
Language identification has a number of applications in natural language processing. N Gram analysis has been the traditional method for language identification. In this paper, we discuss the methods and results from our participation in the Shared Task on Transliterated Search track at Forum for Information Retrieval Evaluation, 2014. We describe a method that leverages the phonetical properti...
متن کاملAddressing challenges in automatic Language Identification of Romanized Text
Due to the diversity of documents on web, language identification is a vital task for web search engines during crawling and indexing of web documents. Among the current challenges in language-identification, the unsettled problem remains identifying Romanized text language. The challenge in Romanized text is the variations in word spellings and sounds in different dialects. We propose a Romani...
متن کاملWriter Identification based on offline Handwritten Document Images in Kannada language using Empirical Mode Decomposition method
متن کامل
NELIS - Named Entity and Language Identification System: Shared Task System Description
This paper proposes a simple and elegant solution for language identification and named entity (NE) recognition at a word level, as a part of Subtask-1: Query Word Labeling of FIRE 2015. Given any query q1:w1 w2 w3 ... wn in Roman script, the task calls for labeling words of the query as English (En) or a member of L, where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada (Kn), Malayalam (...
متن کاملAutomatic identification of language varieties: The case of Portuguese
Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...
متن کامل